NODEJS-681: ControlConnection Concurrent Read and Write on .host and .connection by toptobes · Pull Request #462 · apache/cassandra-nodejs-driver

toptobes · 2026-05-20T00:57:14Z

This PR superceeds Jane He's #430 with her blessing

After some investigation, we were unable to figure out the root cause behind the NPEs, with there being multiple potential avenues where the issue may have originated from, and so we decided to fix the issue at the lowest and simplest level we could–simply adding a stronger concurrency control to _refresh directly via a _refreshInProgress flag

I personally believe the issue stemmed from _setHealthListeners being called multiple times on the same host/connection, causing the listeners to trigger refreshes multiple times for the same event, leading to the NPEs mentioned in the ticket.

However the issue is quite hard to organically reproduce so the theory remains a theory.

Potential trace

_refresh() is called
_refresh() calls _refreshControlConnection()
_refreshControlConnection() fails to borrow a connection so it calls _initializeConnection()
_initializeConnection() calls _setHealthListeners()
_refresh() gets back in control and then also calls _setHealthListeners()

which means that there's the potential of, sequentially:

A new host and connection being set (call them H1 and C1)
Listeners being attached to the H1 and C1
A newer host being set (call it H2)
Listeners being attached to the H2 and C1 without the previous listeners being removed

toptobes · 2026-05-20T00:59:35Z

      assert.strictEqual(cc.hosts.length, 1);
    });

+    it('should not break when refreshing concurrently', async () => {


may need a better heuristic for ensuring the refreshes are okay... not sure... I just "borrowed" this from Jane's original PR

Copilot

Pull request overview

This PR addresses NODEJS-681 by adding concurrency protection around ControlConnection._refresh() to avoid concurrent refresh executions that can lead to inconsistent .host / .connection state, and adds an integration test intended to exercise concurrent refresh behavior.

Changes:

Add _refreshInProgress guard and refactor refresh logic into _unsafeDoRefresh() in ControlConnection.
Add an integration test that triggers many _refresh() calls.
Make promiseUtils.toBackground() tolerate undefined/null inputs via optional chaining.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
lib/control-connection.js	Adds a refresh-in-progress guard and refactors refresh implementation into a separate method.
lib/promise-utils.js	Makes `toBackground()` no-op safely when given a nullish value.
test/integration/short/control-connection-tests.js	Adds a concurrency-focused integration test for control connection refresh.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

SiyaoIsHiding

I think, the main reason why it broke in the past, is that this.host and this.connection are a nullable type, (they are supposed to be null in certain circumstainces like refreshing), but function calls like this._setHealthListeners(this.host, this.connection); are not treating them as nullable. Imagine if this is TypeScript, it would complain that _setHealthListeners expects Host, Connection as argument while this call passes Host?, Connection?.
So, aside from the _refreshInProgress flag to make sure only one refresh is happening at one time, I think we should still make sure we are using this.host and this.connection as nullable.
That means this._setHealthListeners(this.host, this.connection); should be

if (this.host && this.connection){
this._setHealthListeners(this.host, this.connection);
}

And other places that access this.host and this.connection should also has null guards, like this.
What do you think?

toptobes · 2026-05-20T19:56:12Z

I think, the main reason why it broke in the past, is that this.host and this.connection are a nullable type, (they are supposed to be null in certain circumstainces like refreshing), but function calls like this._setHealthListeners(this.host, this.connection); are not treating them as nullable. Imagine if this is TypeScript, it would complain that _setHealthListeners expects Host, Connection as argument while this call passes Host?, Connection?. So, aside from the _refreshInProgress flag to make sure only one refresh is happening at one time, I think we should still make sure we are using this.host and this.connection as nullable. That means this._setHealthListeners(this.host, this.connection); should be
if (this.host && this.connection){
this._setHealthListeners(this.host, this.connection);
}
And other places that access this.host and this.connection should also has null guards, like this. What do you think?

As far as I can tell, the nullability is meant to be a transient state and trying to impose these silent null checks may just cause more subtle bugs than it solves? I wouldn't be against explicit null checks in _setHealthListeners as an assertion, but simply ignoring setting the listeners could be problematic

in an ideal world we may want to use an explicit state machine but that seems pretty out of scope for this PR

SiyaoIsHiding · 2026-05-20T22:50:21Z

Why

silent null checks may just cause more subtle bugs

?
I think they can potentially fix some bugs that we didn't discover yet.
But either way, I'm good. I think this PR can already fix the problem that we clearly know of.
If you fix that eslint error, I will give the explicit approval.

toptobes · 2026-05-21T00:16:24Z

Why

silent null checks may just cause more subtle bugs

? I think they can potentially fix some bugs that we didn't discover yet. But either way, I'm good. I think this PR can already fix the problem that we clearly know of. If you fix that eslint error, I will give the explicit approval.

Just to step back for a second we need to decide if it's even valid to get to the end of _refresh() without the host or the connection being set

I don't think it should be, no?

Meaning if it happens, we're just enabling the bug to propagate silently instead of catching it and yelling to the user that that's happened and needs to be patched (which it really should never happen with this refresh fix or there's a bigger issue)

SiyaoIsHiding · 2026-05-21T00:28:51Z

No it shouldn't reach the end of _refresh without this.host set.
You convinced me 👍

SiyaoIsHiding · 2026-05-26T22:24:54Z

This test failure is new and it concerns me.

  1) ControlConnection
       #init()
         should subscribe to SCHEMA_CHANGE events and refresh keyspace information:
     Error: Condition still false after 100 attempts: () => cc.metadata.keyspaces['sample_change_1'].strategyOptions.replication_factor === '2'
      at whilstItem (test/test-helper.js:769:23)
      at Timeout.next [as _onTimeout] (lib/utils.js:1042:5)
      at listOnTimeout (node:internal/timers:585:17)
      at process.processTimers (node:internal/timers:521:7)

After some investigation, I think the problem is that a schema refresh triggered by an event can be permanently lost if the CC's connection is down at that moment and cannot establish new connection within 2 seconds. And this singleton refresh implementation makes it more vulnerable than before.

What might have happened:

In CI, a brief TCP hiccup (or Cassandra-side connection reset) closed the CC's socket during the 100ms debounce window after the ALTER KEYSPACE event:

ALTER KEYSPACE fires → SCHEMA_CHANGE event → EventDebouncer starts 100ms timer
TCP socket closes → socketClose → _refresh() acquires _refreshInProgress, sets this.connection = null
Debouncer fires → metadata.refreshKeyspace → cc.query() → connection === null → _waitForReconnection() (2s timeout)
_refresh() fails or does not establish new connection within 2 seconds -> _waitForReconnection rejects
Error is swallowed by toBackground() — the ALTER KEYSPACE schema update is silently dropped
CC eventually reconnects but no new SCHEMA_CHANGE event arrives → keyspace stays at replication_factor=3 → poll times out

In the past, when we accidentally allowed concurrent _refresh() calls, If one concurrent attempt succeeded while another failed, the newConnection(null) event would resolve the pending _waitForReconnection before the error could reject it. The new singleton approach eliminates that accidental rescue path.

Fix

Instead of allowing concurrent refresh, I think we should fix the problem of schema refresh errors being swallowed. For example, in control-connection.js _nodeSchemaChangeHandler

  // Instead of: toBackground(this.handleSchemaChange(event, false))
  this.handleSchemaChange(event, false).catch(() => {
    // CC will reconnect; re-queue this event once it does
    this.once('newConnection', (err) => {
      if (!err) toBackground(this.handleSchemaChange(event, false));
    });
  });

SiyaoIsHiding

Above

SiyaoIsHiding · 2026-05-27T20:44:35Z

Bret points out that the Java driver's behavior is that every time a new control connection is established, it refreshes schema

https://github.com/apache/cassandra-java-driver/blob/90b09e0e34fd6fdda064f13f6acd0df7268a3dc6/core/src/main/java/com/datastax/oss/driver/internal/core/control/ControlConnection.java#L460-L469

It makes sense to me that Node.js driver can align with Java driver's behavior.
That would be an easier fix, too.

diff --git a/lib/control-connection.js b/lib/control-connection.js
index 5abf8b06..4199281c 100644
--- a/lib/control-connection.js
+++ b/lib/control-connection.js
@@ -415,6 +415,8 @@ class ControlConnection extends events.EventEmitter {
 
       await this.metadata.refreshKeyspacesInternal(false);
       this.metadata.initialized = true;
+    } else if (this.options.isMetadataSyncEnabled) {
+    await this.metadata.refreshKeyspacesInternal(false);
     }
   }

toptobes · 2026-05-28T22:18:55Z

After some deliberation we think we can just call this.metadata.refreshKeyspaces[Internal?]() within _refreshHosts unconditionally as that function appears to invalidate all such schema caches. Need to triple check if this is the right solution thoguh

SiyaoIsHiding · 2026-06-04T00:28:48Z

Confirming above is correct.
The question is whether refreshKeyspacesInternal is enough to cover any EVENTs potentially lost.
There are 3 kinds of EVENTs

TOPOLOGY_CHANGE, handled by refreshHosts
STATUS_CHANGE, a host UP or DOWN, is handled by host.js's reconnection schedule, independent from ControlConnection
SCHEMA_CHANGE
3a. refreshKeyspacesInternal redo SELECT * FROM system_schema.keyspaces
3b. It also removes the cache for udt, table, function, aggregate, and views. They will be lazy-loaded when used for the first time.
So I think refreshKeyspacesInternal is enough.

SiyaoIsHiding

This test still fails

1) ControlConnection
       #init()
         should subscribe to SCHEMA_CHANGE events and refresh keyspace information:
     Error: Condition still false after 100 attempts: () => !cc.metadata.keyspaces['sample_change_1']
      at whilstItem (test/test-helper.js:769:23)
      at Timeout.next [as _onTimeout] (lib/utils.js:1042:5)
      at listOnTimeout (node:internal/timers:605:17)
      at process.processTimers (node:internal/timers:541:7)

I think it's possible that

When the driver is selecting keyspace sample_change_1 info, that keyspace still existed in the DB, so DB sends the response of keyspace info, but it does not reach the driver yet.
The DROP statement executes.
Driver delete metadata.keyspaces['sample_change_1']
The sample_change_1 keyspace info finally reaches the driver
Driver assigns cc.metadata.keyspaces['sample_change_1']

I need to think about how to fix this..... But I think this path is possible to fail this test.

SiyaoIsHiding · 2026-06-05T02:50:50Z

+      // Don't attempt to reconnect when the ControlConnection is being shutdown
      if (self._isShuttingDown) {
-        // Don't attempt to reconnect when the ControlConnection is being shutdown
+        this.log('info', 'The ControlConnection will not be refreshed as the Client is being shutdown');


Suggested change

this.log('info', 'The ControlConnection will not be refreshed as the Client is being shutdown');

self.log('info', 'The ControlConnection will not be refreshed as the Client is being shutdown');

SiyaoIsHiding · 2026-06-05T03:26:55Z

-      this.metadata.initialized = true;
    }
+
+    this.metadata.initialized = true;


nit: this will have the side effect that if metadata sync is not enabled, this.metadata.initialized = true; will still run everytime it refreshes, not just once at initialization.

add stronger concurrency control directly in _refresh

0dfe29e

toptobes commented May 20, 2026

View reviewed changes

Comment thread lib/control-connection.js

toptobes commented May 20, 2026

View reviewed changes

Comment thread lib/promise-utils.js Outdated

toptobes commented May 20, 2026

View reviewed changes

Comment thread lib/control-connection.js Outdated

SiyaoIsHiding requested a review from Copilot May 20, 2026 01:07

Copilot started reviewing on behalf of SiyaoIsHiding May 20, 2026 01:08 View session

SiyaoIsHiding requested review from SiyaoIsHiding, absurdfarce and jorgebay May 20, 2026 01:08

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread lib/control-connection.js Outdated

Comment thread lib/control-connection.js Outdated

Comment thread lib/promise-utils.js

Comment thread test/integration/short/control-connection-tests.js

Potential fix for pull request finding

37e314f

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

SiyaoIsHiding requested changes May 20, 2026

View reviewed changes

Comment thread test/integration/short/control-connection-tests.js

Comment thread test/integration/short/control-connection-tests.js

fix eslint issue

91a7b81

toptobes force-pushed the control-connection-refresh branch from 1ca12ed to 91a7b81 Compare May 21, 2026 00:27

toptobes requested a review from SiyaoIsHiding May 21, 2026 01:34

SiyaoIsHiding requested changes May 26, 2026

View reviewed changes

toptobes added 2 commits June 3, 2026 21:08

address pr comments, including refreshing schema on every reconnection

d7a20fc

remove void

6356808

SiyaoIsHiding reviewed Jun 5, 2026

View reviewed changes

	this.log('info', 'The ControlConnection will not be refreshed as the Client is being shutdown');
	self.log('info', 'The ControlConnection will not be refreshed as the Client is being shutdown');

Conversation

toptobes commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

toptobes May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SiyaoIsHiding left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

toptobes commented May 20, 2026

Uh oh!

SiyaoIsHiding commented May 20, 2026

Uh oh!

toptobes commented May 21, 2026

Uh oh!

SiyaoIsHiding commented May 21, 2026

Uh oh!

SiyaoIsHiding commented May 26, 2026

Uh oh!

SiyaoIsHiding left a comment

Choose a reason for hiding this comment

Uh oh!

SiyaoIsHiding commented May 27, 2026

Uh oh!

toptobes commented May 28, 2026

Uh oh!

SiyaoIsHiding commented Jun 4, 2026

Uh oh!

SiyaoIsHiding left a comment

Choose a reason for hiding this comment

Uh oh!

SiyaoIsHiding Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

SiyaoIsHiding Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

toptobes commented May 20, 2026 •

edited

Loading